Introduction to Open Data Science - Course Project

About the project

I first noticed this course from our doctoral student mailing list and thought it sounded interesting. I think Open Data and Open Science are both important aspects of good modern research, which is why I enrolled to this course to learn more about them. Additionally, I hadn’t used R or RStudio let alone GitHub before this course, so I thought I could also learn to use these useful tools as well. So far this course has seemed very well organized, looking forward to learning more!

Reflecting on my learning experience

I thought that the R health for Data Science book together with the Exercise Set 1 were an excellent way to introduce us to basics of RStudio quickly and clearly. I especially liked using the exercise set, because it gave a nice visual after first reading about the different tools and commands first in the book. I don’t know if there was a certain topic that was more difficult than others, rather I found it to be more of learning the logic of R and use it for your own purposes.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Tue Dec  6 00:45:24 2022"

Click here to see my GitHub repository.



Assignment 2

Describe the work you have done this week and summarize your learning.

date()
## [1] "Tue Dec  6 00:45:24 2022"

1. Part

The data I’m using in my analysis is derived from the larger “JYTOPKYS2”-dataset. This dataset was collected from a survey to statistics students, that was used to evaluate to effects of learning approaches to exam results.

Seven variables from the original dataset were chosen for this data (gender, age, attitude, deep learning, strategic learning, surface learning and exam points). Additionally, students who scored 0 in their exam were excluded from this data.

The variable attitude in learning2014 is a sum of 10 questions related to students attitude towards statistics, each measured on the Likert scale (1-5).

Variables deep (deep learning), stra(strategic learning) and surf(surface learning) in learning2014 are the mean values from the combinations of questions that relate to each learning approach, respectively. Each question was once again measured on the Likert scale (1-5).

In the following R chunk I will explore the structure and dimensions of the data further.

# reading the data into memory
learning2014 <- read.csv("~/Documents/Tohtoritutkinto/Open Science/IODS-project/data/learning2014.csv", sep = ",", header = TRUE)

#displaying dataframe
learning2014
##     gender age attitude     deep  stra     surf points
## 1        F  53       37 3.583333 3.375 2.583333     25
## 2        M  55       31 2.916667 2.750 3.166667     12
## 3        F  49       25 3.500000 3.625 2.250000     24
## 4        M  53       35 3.500000 3.125 2.250000     10
## 5        M  49       37 3.666667 3.625 2.833333     22
## 6        F  38       38 4.750000 3.625 2.416667     21
## 7        M  50       35 3.833333 2.250 1.916667     21
## 8        F  37       29 3.250000 4.000 2.833333     31
## 9        M  37       38 4.333333 4.250 2.166667     24
## 10       F  42       21 4.000000 3.500 3.000000     26
## 11       M  37       39 3.583333 3.625 2.666667     31
## 12       F  34       38 3.833333 4.750 2.416667     31
## 13       F  34       24 4.250000 3.625 2.250000     23
## 14       F  34       30 3.333333 3.500 2.750000     25
## 15       M  35       26 4.166667 1.750 2.333333     21
## 16       F  33       41 3.666667 3.875 2.333333     31
## 17       F  32       26 4.083333 1.375 2.916667     20
## 18       F  44       26 3.500000 3.250 2.500000     22
## 19       M  29       17 4.083333 3.000 3.750000      9
## 20       F  30       27 4.000000 3.750 2.750000     24
## 21       M  27       39 3.916667 2.625 2.333333     28
## 22       M  29       34 4.000000 2.375 2.416667     30
## 23       F  31       27 4.000000 3.625 3.000000     24
## 24       F  37       23 3.666667 2.750 2.416667      9
## 25       F  26       37 3.666667 1.750 2.833333     26
## 26       F  26       44 4.416667 3.250 3.166667     32
## 27       M  30       41 3.916667 4.000 3.000000     32
## 28       F  33       37 3.750000 3.625 2.000000     33
## 29       F  33       25 3.250000 2.875 3.500000     29
## 30       M  28       30 3.583333 3.000 3.750000     30
## 31       M  26       34 4.916667 1.625 2.500000     19
## 32       F  27       32 3.583333 3.250 2.083333     23
## 33       F  25       20 2.916667 3.500 2.416667     19
## 34       F  31       24 3.666667 3.000 2.583333     12
## 35       M  20       42 4.500000 3.250 1.583333     10
## 36       F  39       16 4.083333 1.875 2.833333     11
## 37       M  38       31 3.833333 4.375 1.833333     20
## 38       M  24       38 3.250000 3.625 2.416667     26
## 39       M  26       38 2.333333 2.500 3.250000     31
## 40       M  25       33 3.333333 1.250 3.416667     20
## 41       F  30       17 4.083333 4.000 3.416667     23
## 42       F  25       25 2.916667 3.000 3.166667     12
## 43       M  30       32 3.333333 2.500 3.500000     24
## 44       F  48       35 3.833333 4.875 2.666667     17
## 45       F  24       32 3.666667 5.000 2.416667     29
## 46       F  40       42 4.666667 4.375 3.583333     23
## 47       M  25       31 3.750000 3.250 2.083333     28
## 48       F  23       39 3.416667 4.000 3.750000     31
## 49       F  25       19 4.166667 3.125 2.916667     23
## 50       F  23       21 2.916667 2.500 2.916667     25
## 51       M  27       25 4.166667 3.125 2.416667     18
## 52       M  25       32 3.583333 3.250 3.000000     19
## 53       M  23       32 2.833333 2.125 3.416667     22
## 54       F  23       26 4.000000 2.750 2.916667     25
## 55       F  23       23 2.916667 2.375 3.250000     21
## 56       F  45       38 3.000000 3.125 3.250000      9
## 57       F  22       28 4.083333 4.000 2.333333     28
## 58       F  23       33 2.916667 4.000 3.250000     25
## 59       M  21       48 3.500000 2.250 2.500000     29
## 60       M  21       40 4.333333 3.250 1.750000     33
## 61       F  21       40 4.250000 3.625 2.250000     33
## 62       F  21       47 3.416667 3.625 2.083333     25
## 63       F  26       23 3.083333 2.500 2.833333     18
## 64       F  25       31 4.583333 1.875 2.833333     22
## 65       F  26       27 3.416667 2.000 2.416667     17
## 66       M  21       41 3.416667 1.875 2.250000     25
## 67       F  23       34 3.416667 4.000 2.833333     28
## 68       F  22       25 3.583333 2.875 2.250000     22
## 69       F  22       21 1.583333 3.875 1.833333     26
## 70       F  22       14 3.333333 2.500 2.916667     11
## 71       F  23       19 4.333333 2.750 2.916667     29
## 72       M  22       37 4.416667 4.500 2.083333     22
## 73       M  23       32 4.833333 3.375 2.333333     21
## 74       M  24       28 3.083333 2.625 2.416667     28
## 75       F  22       41 3.000000 4.125 2.750000     33
## 76       F  23       25 4.083333 2.625 3.250000     16
## 77       M  22       28 4.083333 2.250 1.750000     31
## 78       M  20       38 3.750000 2.750 2.583333     22
## 79       M  22       31 3.083333 3.000 3.333333     31
## 80       M  21       35 4.750000 1.625 2.833333     23
## 81       F  22       36 4.250000 1.875 2.500000     26
## 82       F  23       26 4.166667 3.375 2.416667     12
## 83       M  21       44 4.416667 3.750 2.416667     26
## 84       M  22       45 3.833333 2.125 2.583333     31
## 85       M  29       32 3.333333 2.375 3.000000     19
## 86       F  29       39 3.166667 2.750 2.000000     30
## 87       F  21       25 3.166667 3.125 3.416667     12
## 88       M  28       33 3.833333 3.500 2.833333     17
## 89       F  21       33 4.250000 2.625 2.250000     18
## 90       F  30       30 3.833333 3.375 2.750000     19
## 91       F  21       29 3.666667 2.250 3.916667     21
## 92       M  23       33 3.833333 3.000 2.333333     24
## 93       F  21       33 3.833333 4.000 2.750000     28
## 94       F  21       35 3.833333 3.500 2.750000     17
## 95       F  20       36 3.666667 2.625 2.916667     18
## 96       M  22       37 4.333333 2.500 2.083333     17
## 97       M  21       42 3.750000 3.750 3.666667     23
## 98       M  21       32 4.166667 3.625 2.833333     26
## 99       F  20       50 4.000000 4.125 3.416667     28
## 100      M  22       47 4.000000 4.375 1.583333     31
## 101      F  20       36 4.583333 2.625 2.916667     27
## 102      F  20       36 3.666667 4.000 3.000000     25
## 103      M  24       29 3.666667 2.750 2.916667     23
## 104      F  20       35 3.833333 2.750 2.666667     21
## 105      F  19       40 2.583333 1.375 3.000000     27
## 106      F  21       35 3.500000 2.250 2.750000     28
## 107      F  21       32 3.083333 3.625 3.083333     23
## 108      F  22       26 4.250000 3.750 2.500000     21
## 109      F  25       20 3.166667 4.000 2.333333     25
## 110      F  21       27 3.083333 3.125 3.000000     11
## 111      F  22       32 4.166667 3.250 3.000000     19
## 112      F  25       33 2.250000 2.125 4.000000     24
## 113      F  20       39 3.333333 2.875 3.250000     28
## 114      M  24       33 3.083333 1.500 3.500000     21
## 115      F  20       30 2.750000 2.500 3.500000     24
## 116      M  21       37 3.250000 3.250 3.833333     24
## 117      F  20       25 4.000000 3.625 2.916667     20
## 118      F  20       29 3.583333 3.875 2.166667     19
## 119      M  31       39 4.083333 3.875 1.666667     30
## 120      F  20       36 4.250000 2.375 2.083333     22
## 121      F  22       29 3.416667 3.000 2.833333     16
## 122      F  22       21 3.083333 3.375 3.416667     16
## 123      M  21       31 3.500000 2.750 3.333333     19
## 124      M  22       40 3.666667 4.500 2.583333     30
## 125      F  21       31 4.250000 2.625 2.833333     23
## 126      F  21       23 4.250000 2.750 3.333333     19
## 127      F  21       28 3.833333 3.250 3.000000     18
## 128      F  21       37 4.416667 4.125 2.583333     28
## 129      F  20       26 3.500000 3.375 2.416667     21
## 130      F  21       24 3.583333 2.750 3.583333     19
## 131      F  25       30 3.666667 4.125 2.083333     27
## 132      M  21       28 2.083333 3.250 4.333333     24
## 133      F  24       29 4.250000 2.875 2.666667     21
## 134      F  20       24 3.583333 2.875 3.000000     20
## 135      M  21       31 4.000000 2.375 2.666667     28
## 136      F  20       19 3.333333 3.875 2.166667     12
## 137      F  20       20 3.500000 2.125 2.666667     21
## 138      F  18       38 3.166667 4.000 2.250000     28
## 139      F  21       34 3.583333 3.250 2.666667     31
## 140      F  19       37 3.416667 2.625 3.333333     18
## 141      F  21       29 4.250000 2.750 3.500000     25
## 142      F  20       23 3.250000 4.000 2.750000     19
## 143      M  21       41 4.416667 3.000 2.000000     21
## 144      F  20       27 3.250000 3.375 2.833333     16
## 145      F  21       35 3.916667 3.875 3.500000      7
## 146      F  20       34 3.583333 3.250 2.500000     21
## 147      F  18       32 4.500000 3.375 3.166667     17
## 148      M  22       33 3.583333 4.125 3.083333     22
## 149      F  22       33 3.666667 3.500 2.916667     18
## 150      M  24       35 2.583333 2.000 3.166667     25
## 151      F  19       32 4.166667 3.625 2.500000     24
## 152      F  20       31 3.250000 3.375 3.833333     23
## 153      F  20       28 4.333333 2.125 2.250000     23
## 154      F  17       17 3.916667 4.625 3.416667     26
## 155      M  19       19 2.666667 2.500 3.750000     12
## 156      F  20       35 3.083333 2.875 3.000000     32
## 157      F  20       24 3.750000 2.750 2.583333     22
## 158      F  20       21 4.166667 4.000 3.333333     20
## 159      F  20       29 4.166667 2.375 2.833333     21
## 160      F  19       19 3.250000 3.875 3.000000     23
## 161      F  19       20 4.083333 3.375 2.833333     20
## 162      F  22       42 2.916667 1.750 3.166667     28
## 163      M  35       41 3.833333 3.000 2.750000     31
## 164      F  18       37 3.166667 2.625 3.416667     18
## 165      F  19       36 3.416667 2.625 3.000000     30
## 166      M  21       18 4.083333 3.375 2.666667     19
# the dataset has 166 rows and 7 columns
dim(learning2014)
## [1] 166   7
# there are 7 variables and 166 observations. One variable is a character string, while the other variables are integers or numbers
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
# a summary of the variables
summary(learning2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :14.00   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :32.00   Median :3.667  
##                     Mean   :25.51   Mean   :31.43   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :50.00   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

2. Part

Here I present a graphical overview of the data and show summaries of the variables in the data.

# accessing the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# creating a plot matrix to give a graphical overview of the data
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

# drawing the plot
p

# creating summaries of the variables
summary(factor(learning2014$gender))
##   F   M 
## 110  56
summary(learning2014$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   21.00   22.00   25.51   27.00   55.00
summary(learning2014$attitude)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   26.00   32.00   31.43   37.00   50.00
summary(learning2014$deep)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   3.333   3.667   3.680   4.083   4.917
summary(learning2014$stra)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.250   2.625   3.188   3.121   3.625   5.000
summary(learning2014$surf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   2.417   2.833   2.787   3.167   4.333
summary(learning2014$points)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   19.00   23.00   22.72   27.75   33.00

The data is divided in the plot matrix by gender, with females depicted by the color red and males by the color green. Looking at the distributions of variables in the graphical overview, you can quickly notice that the age-variable is strongly skewed to the right. This is expected as the study population consists of students. We also notice a slight left skew in deep learning (deep) and the points-variable. The attitude-variable is slightly left skewed in males, but almost normal distribution in females. Additionally, suface learning (surf) is right skewed in males, but again closer to normal distribution in females.

We can further appreciate the distributions and characteristics of individual variables by looking at the scatter and box plots of the plot matrix. For example, you could inspect if a certain variable has lots of outliers in the box plot or if spread is larger or different in the scatter plot.

The graphical overview also shows correlation coefficients between variables. The most notable of these is the strong positive correlation between attitude and points. Also worth noting is the strong negative correlation between surface learning (surf) and both attitudeand deep learning (deep) in male students that is absent in female students. There is also a significant negative correlation between surface learning (surf) and strategic learning (stra) when the students are analyzied overall.

Looking at the summaries of the variables, you notice that the median age is quite young (median = 22 years) and that there are almost twice as many females as there are men (110 females to 56 males).

3. Part

Here I create a regression model where exam points is the target variable.

# accessing the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)

# creating a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)

summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
# creating new model without non-significant explanatory variable surf
my_model2 <- lm(points ~ attitude + stra, data = learning2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

My initial model had attitude, strategic and surface learning as explanatory variables, as they seemed to have the largest correlation coefficients in the plot matrix. Looking at the coefficients, we notice that there are positive estimates in attitude and strategic learning, while surface learning has a negative estimate. The t-statistic is the coefficient divided by the standard error. The larger the t-statistic value is, the more likely it is that the coefficient isn’t 0 and that there is a relationship between the variables. The t-value of attitude is clearly above 0, meaning there is a possibility that we can reject the null hypothesis (no relationship between target variable and explanatory variable) and declare a relationship between points and attitude. This is supported by the small Pr(>|t|) in attribute, meaning there is a very small chance of seeing a relationship between points and attitude due to chance. The same cannot be said about strategic and surface learning. While strategic learning does have a t-value above 0, the probability that this is due to chace is above the cut-off point (p <0,05) as stra has a Pr(>|t|) value of 0.11716. Surface learning is even more likely to have a relationship due to chance as surf has a t-value of -0.731 and a Pr(>|t|) value of 0.11716.

After the first model, I removed the surfvariable from the model since it didn’t have a significant relationship with the target variable points. The new model was fitted with only ´attitudeandstra` as explanatory variables.

4. Part

Next I use the summary of my new fitted model and explain the relationship between the chosen explanatory variables and the target variable.

# accessing the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)

# my new model without the non-significant explanatory variable surf and its summary
my_model2 <- lm(points ~ attitude + stra, data = learning2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

In my new model we notice that both attitude and strategic learning have still positive estimates (0.34658 and 0.91365 respectively). This means that for every point the attitude-variable increases, the exam points increase 0.34658 on average (and 0.91365 in the case of stra). The t-statistics of attitude is clearly above 0 (t-value: 6.132), meaning there is a high possibility that we can declare a relationship between points and attitude. This is supported by the small Pr(>|t|) in attribute (Pr(>|t|): 6.31e-09), meaning there is a very small chance of seeing a relationship between points and attitude due to chance. Even though the t-statistics of stra is also clearly above 0 (t-value: 1.709) and the Pr(>|t|) is smaller than before (Pr(>|t|): 0.08927), it still falls short of the default cut-off point of p <0.05. Therefore, we cannot say that there is a significant relationship between pointsand stra. Rather there is strong evidence of a potential relationship, but there is approximately an 8,9% probability that this is due to chance.

The R-squared statistic provides a measure of how well the model fits data. A model with a R-squared value of 0 doesn’t explain at all the variance in exam points and on the other hand a model with a R-squared value of 1 explains all of the variance in exam points. The square of the multiple correlation coefficient (multiple R-squared) of the model is 0.2048. As the multiple R-squared value increases with each additional explanatory variable, it is preferred to use the adjusted R-squared as it adjusts the value for the numbers of variables considered. The adjusted square of the multiple correlation coefficient (adjusted R-squared) is 0.1951, i.e. these variables accounted for about 20% of the variation in the exam points. This is marginally more than in the previous model. The residual standard error is a measure of the quality of the linear regression fit. It is the average amount that the target variable points will deviate from the true regression line. The smaller the residual standard error the better the model fits the dataset. In this model the residual standard error is 5.289, which is slightly better than the first model. The omnibus F-test had a very low associated p-value (p-value: 7.734e-09), so there is very strong evidence that neither of the two regression coefficients are zero.

5. Part

Finally I produce the following diagnostic plots of my model: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.

# divides the window into a 2x2 grid
par(mfrow = c(2,2))

# draws the diagnostic plots using the plot() function. The previously specified plots 1, 2 and 5 are selected
plot(my_model2, c(1,2,5))

Linear regression has four main assumptions: linear relationship, independence, homoscedasticity and normality.

We can use the residuals vs. fitted plot to evaluate if the there is correlation between residuals and fitted values. As we can see, the datapoints have a fairly random distribution around the 0 line, which supports the hypothesis that it is linear and the data is reasonably homoscedastic. There are a few outliers with large negative residuals, but these don’t appear to alter the overall variance in the data.

Using the normal Q-Q plot, we can evaluate the skewness of data and fit of model. My model seems to be slightly skewed left, but is fairly normally distributed overall.

Finally, we can use the residuals vs. leverage plot to assess linearity and influential points. Influential points are observations, that when removed from the dataset could change the coefficients of the model when fitting again. Here we can see that the the spread is nicely linear. No influentials points are observed in the spread.



Assignment 3 - Logistic regression

date()
## [1] "Tue Dec  6 00:45:28 2022"

Reading joined student alcohol consumption data

The data we are using in this assignment was acquired from the UCI Machine Learning Repository, Student Performance Data (incl. Alcohol consumption) page and later modified.

#install required packages if needed
#install.packages("tidyverse")
#install.packages("boot")
#installed.packages("tidyr")
#install.packages("readr")
#install.packages("dplyr")
#install.packages("patchwork")

# access the readr and dplyr library
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#read the 'alc' data set into memory
alc <- read_csv("~/Documents/Tohtoritutkinto/Open Science/IODS-project/data/alc.csv", show_col_types=FALSE)

# look at the column names/names of the variables of the data set
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The data was collected from students in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. As we can see, there are 35 variables in our data set. Our focus in this assignment is to evaluate the possible relationship between high/low alcohol use (high_use) and other variables.

Choosing variables in relation to high/low alcohol consumption

I chose student’s sex (sex), going out with friends (goout), number of past class failures (failures) and number of schools absences (absences) as my variables of interest. My hypothesis is that male students have on average a higher consumption of alcohol when compared to female students, as males have also higher alcohol consumption in the general population. I also hypothesize that students with more absences and class failures have higher alcohol consumption on average, as people with excessive alcohol might miss work due to hangovers etc. Additionally, I hypothesize that going out with friends more often increases risk of alcohol consumption as alcohol tends to be frequent component when young adults socialize.

Exploring the chosen variables

# access tidyr library
library(tidyr)

# produce summary statistics by group
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_going_out = mean(goout), mean_failures = mean(failures), mean_absences = mean(absences))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 6
## # Groups:   sex [2]
##   sex   high_use count mean_going_out mean_failures mean_absences
##   <chr> <lgl>    <int>          <dbl>         <dbl>         <dbl>
## 1 F     FALSE      154           2.95         0.104          4.25
## 2 F     TRUE        41           3.39         0.293          6.85
## 3 M     FALSE      105           2.70         0.143          2.91
## 4 M     TRUE        70           3.93         0.386          6.1

Looking at our summarised table, we can see that there are several more males that have high alcohol consumption rates when compared to females (70 and 41, respectively), even though there are almost the same amount of males and females in total (175 and 195, respectively). We can also make note that students with high alcohol consumption go out with their friends more often than those students that don’t have high alcohol consumption, regardless of sex. Additionally, we can also notice that the average number of class failures and school absences are higher with students with high alcohol consumption than others regardless of sex.

Next we will take a closer look at the distributions of failuresand absences.

# access ggplot2 and patchwork libraries
library(ggplot2)
library(patchwork)
theme_set(theme_classic())

# closer look at the distribution of failures and goout
alc %>% count(failures)
## # A tibble: 4 × 2
##   failures     n
##      <dbl> <int>
## 1        0   325
## 2        1    24
## 3        2    17
## 4        3     4
alc %>% count(goout)
## # A tibble: 5 × 2
##   goout     n
##   <dbl> <int>
## 1     1    22
## 2     2    97
## 3     3   120
## 4     4    78
## 5     5    53
# create histograms for failures and goout
h1 <- alc %>% ggplot(aes(x = failures)) + geom_histogram(binwidth = 1) + scale_fill_grey()
h2 <- alc %>% ggplot(aes(x = goout)) + geom_histogram(binwidth = 1) + scale_fill_grey()

# draw histograms of failures and goout
h1 + h2

As we can see from the tables and histograms, class failures are heavily skewed right as the vast majority of students haven’t failed any class and only a handful of students have failed at least one class (325 and 45, respectively). Going out with friends has almost a normal distribution, although only a few students go out with their friends very little.

Next we will look at the chosen variables together with their relationship with alcohol consumption.

# define the plot as a bar plot
g1 <- alc %>% ggplot(aes(x = sex, fill = high_use)) + geom_bar(position = "fill") + ylab("proportion") + theme(legend.position = "none") + scale_fill_grey()

# define the plot as a bar plot
g2 <- alc %>% ggplot(aes(x = failures, fill = high_use)) + geom_bar(position = "fill") + ylab("proportion") + theme(legend.position = "none") + scale_fill_grey()

# define the plot as a box plot
g3 <- alc %>% ggplot(aes(x = high_use, y = absences)) + geom_boxplot() + xlab("high alcohol consumption")

# define the plot as a bar plot
g4 <- alc %>% ggplot(aes(x = goout, fill = high_use)) + geom_bar(position = "fill") + ylab("proportion") + xlab("going out with friends") + scale_fill_grey(name = "high alcohol consumption")

# draw the plots
g1 + g2 + g3 + g4

Looking first at the sex variable, we notice that a higher proportion of males have high alcohol consumption compared to women as we also noted previously.

Next looking at the failures variable, we notice that the proportion of students with high alcohol consumption increases with the number of failed classes, which suggests a positive correlation between them.

Next looking at the absences variable, we notice that the median absences and spread seems to be larger with students with high alcohol consumption. This might also point to positive correlation between them.

Finally, looking at the goout variable, we notice that proportion of students with alcohol consumption increases with the number of failed classes, similarly to the failures variable. This also suggests a positive correlation between them.

The findings in these plots seem to support my initial hypothesis regarding these chosen variables.

Creating a logistic regression model using the selected variables

Next I created a logistic regression model to predict high/low alcohol consumption. The target value is high_use and the explanatory variables are failures, absences, sex and goout. Here i provide a summary of my model and interpret the coefficients in my model as odds ratios.

# create logistic regression model using selected variables
m <- glm(high_use ~ failures + absences + sex + goout, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ failures + absences + sex + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1521  -0.7929  -0.5317   0.7412   2.3706  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.14389    0.47881  -8.654  < 2e-16 ***
## failures     0.48932    0.23073   2.121 0.033938 *  
## absences     0.08246    0.02266   3.639 0.000274 ***
## sexM         0.97989    0.26156   3.746 0.000179 ***
## goout        0.69801    0.12093   5.772 7.83e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 369.50  on 365  degrees of freedom
## AIC: 379.5
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR) for the coefficients of the model by exponentiation
OR <- coef(m) %>% exp

# compute confidence intervals (CI) for the odds ratios by exponentiation
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR       2.5 %     97.5 %
## (Intercept) 0.01586098 0.005950831 0.03905075
## failures    1.63121360 1.042981792 2.59042892
## absences    1.08595465 1.039999394 1.13814475
## sexM        2.66415000 1.605059602 4.48576247
## goout       2.00975350 1.594527456 2.56461657

All the explanatory variables have a positive regression coefficient, which means that the variables increase the probability of high consumption of alcohol. The z-value is the regression coefficient divided by the standard error. If it is clearly above or below 0, then it is likely that it is significant variable. Usually the cut-off value used for significance is 2.0 (which corresponds to p <0.05), which is met by all my chosen variables. We can see that failures has a significant relationship with high_use and the other chosen variables have very strongly significant relationships with high_use.

Looking at the odds ratios of my explanatory variables, we see that there is a strong positive association between the variables `goout and sex and high_use as both the OR and CI are clearly above 1.0. In addition, the variables failures and absences have also a positive association with high_use although their increase in probability of high alcohol consumption is not as large as with goout and sex.

Next we explore the predictive power of my model.

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   243   16
##    TRUE     61   50
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65675676 0.04324324 0.70000000
##    TRUE  0.16486486 0.13513514 0.30000000
##    Sum   0.82162162 0.17837838 1.00000000
# computing the total proportion of inaccurately classified individuals ( = training error)
(16+61)/370
## [1] 0.2081081

From the cross tabulation we can see that our model is fairly good at predicting low consumption of alcohol, but not as good when alcohol consumption is high (243/259 and 50/111, respectively). This is made clearer by looking at the point plot.

The training error of my model was about 0.21.

Assuming a person just simply guesses by random if the alcohol consumption of a student is high or low, then it is likely that the training error of this guessing would approach 0.50 (flipping a coin). This would mean my model is able to classify approximately 30% more of students correctly than by simply guessing.

Bonus: Performing 10-fold cross validation of my model

As an additional task we were asked to perform a 10-fold cross validation of our models.

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# 10-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2081081

As we can see, my model has a prediction error of about 0.21, which is better than the model introduced in the Exercise Set 3 (prediction error ≈ 0,26).



Assignment 4 - Clustering and classification

date()
## [1] "Tue Dec  6 00:45:29 2022"
# install.packages(c("MASS", "corrplot", "tidyverse", "reshape2")) if needed

Exploring our data

In this assignment we look at the Boston data set from the MASS library. It describes housing values in suburbs of Boston and various other factors relating to the housings and locations. The data frame has 14 variables and 506 observations (14 columns and 506 rows). Here I show the structure of the data frame and a short summary of the variables in question.

# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
## 
##     area
## The following object is masked from 'package:dplyr':
## 
##     select
# load the Boston data
data("Boston")

# explore the dataset
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Graphical overview of data

Here I visualize the distribution of the variables and the relationships between them.

# access the required libraries
library(tidyr)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
# convert data from wide to long
melt.boston <- melt(Boston)
## No id variables; using all as measure variables
# visualize variables with small multiple chart
g1 <- ggplot(data = melt.boston, aes(x = value)) + 
stat_density() + 
facet_wrap(~variable, scales = "free")
g1

# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(2)

# test correlation 
testRes = cor.mtest(cor_matrix, conf.level = 0.95)

# visualize the correlation matrix
corrplot.mixed(cor_matrix, tl.col = "black", lower.col = "black", number.cex = 0.7, tl.cex=0.7)

Looking at the previous summary of the variables and visualization afterwards, we can get a good understanding of the data. We can see that crime rate crim, proportion of residential lands zoned zn, distance to employment centers dis and lower status of the population lstat are all clearly skewed right. On the other side, proportion of blacks black, proportion of units built before 1940 age and pupil-teacher ratio ptratio are all clearly skewed left. Porportion of non-retail business acres indus, accessability to highways rad and property-tax rate tax have bimodal distributions (Charles River dummy variable char is a binary variable). Looking at the correlation matrix, we can see that the highest positive correlations are between tax and rad, indus and nox (nitrogen oxide concentration), age and nox, and rm (average number of rooms) and medv (median value). We can also see that the highest negative correlations are between dis and nox, dis and age, medvand lstat, and dis and indus.

Standardizing the data

In order to standardize our data we have to scale the data.

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# print summary of scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

After scaling the data we can notice that the means of all variables are 0, meaning that the scale has been fitted for each variable so that the range of each variable is approximately the same.

Next we create a categorical variable of the crime rate in the scaled data set. After this, we divide the data set to train and test sets (80% of the data belongs to the train set).

#access the dplyr library
library(dplyr)

# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

Fitting the linear discriminant analysis

Here we fit a linear discriminant analysis (LDA) on the train set, using the categorical crime rate crime as the target variable and all the other variables in the data set as predictor variables. We then visualize this analysis with a biplot.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

## Predicting classes using the LDA model

Next we first save the crime categories from the test set and then remove the categorical crime variable from the test data set. AFter this, we predict the classes with the LDA model on the test data. Finally we cross tabulate the results with the crime categories from the test set.

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12      15        2    0
##   med_low    3      19        5    0
##   med_high   0       8       17    0
##   high       0       0        1   20

Looking at the cross tabulation we can see that the overall prediction rate of our model is approximately 68% (correct predictions devided by all predictions), which is fairly good. We notice that our model is really good at predicting high crime rate, while not as good at predicting the other categories.

Measuring distance and clustering

Next we reload the Boston data set and standardize it. Then we calculate the distance between observations.

library(MASS)
data("Boston")

# center and standardize variables as done previously
boston_scaled2 <- scale(Boston)

# change the object to data frame
boston_scaled2 <- as.data.frame(boston_scaled2)

# create the euclidean distance matrix
dist_eu <- dist(boston_scaled2)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Next we run the k-means algorithm on the data set and after finding the optimal number of clusters we visualize them.

# Use seed to mitigate the effect of random number generators
set.seed(1234)

# k-means clustering
km <- kmeans(Boston, centers = 3)

# plot the scaled data set with clusters
pairs(boston_scaled2, col = km$cluster)

# determine the number of clusters for optimization
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

# k-means clustering with optimal number of clusters
km2 <- kmeans(boston_scaled2, centers = 2)

# plot the scaled dataset with clusters
pairs(boston_scaled2, col = km2$cluster)

The k-means alorithm was initially done with three clusters. Visualizing how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes, we can see the total WCSS drops radically at two clusters meaning this is our optimal number of clusters. We then preformed the k-means algorithm again with the optimal number of clusters and visualized it. Looking at our plot with two clusters we can see that overall the clusters are fairly well separated within the variables, meaning the clustering seems to work properly. Comparing the final plot to the original k-mean algorith with 3 clusters we can see that there is more overlap between clusters and is therefore harder to interpret.



Assignment 5 - Dimensionality reduction techniques

date()
## [1] "Tue Dec  6 00:45:35 2022"

Exploring the data

Here we look at the variables of the human data, that we made previously.

#read the 'human' data set into memory
human <- read.csv("~/Documents/Tohtoritutkinto/Open Science/IODS-project/data/human.csv", row.names = 1)

summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
#access GGally library
library(GGally)

#visualize the 'human' variables
ggpairs(human, progress = FALSE)

#access corrplot library
library(corrplot)

#compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot()

As we look at the summaries of the variables and the pairs plot, we can see that Gross National Income per capita (GNI), maternal mortality ratio (Mat.Mor) and adolescent birth rate (Ado.Birth) all have heavily right-skewed distributions. On the other hand, both Life expectancy (Life.Exp) and ratio of labor force participation of females and males (Labo.FM) have left-skewed distributions. The distribution ratio of female and male populations with secondary education has a positive kurtosis.

Looking at the pairs plot and correlation plot we notice that there are a few significant correlations between our variables. There is a high positive correlation between Ado.Birth and Mat.Mor (0.759). There is also a high negative correlation between Life.Exp and both Mat.Mor (-0.857) and Ado.Birth (-0.729). On the other hand, there is a high positive correlation between Life.Exp and expected years of schooling (Edu.Exp) (0.789). Lastly, there is a positive correlation between Gross National Income per capita (GNI) and both Life.Exp (0.627) and Edu.Exp (0.624).

Principal component analysis

Next we perform the principal component analysis (PCA) on the raw (non-standardized) human data.

#perform principal component analysis (with the singular value decomposition method)
pca_human <- prcomp(human)

#create a summary of pca_human
s <- summary(pca_human)

#rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 2)

#print out the percentages of variance
pca_pr
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 99.99  0.01  0.00  0.00  0.00  0.00  0.00  0.00
#create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

#draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("navyblue", "firebrick1"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
*PCA with raw data: PC1 explains almost all of the variance. GNI has a heavy negative loading on PC1, while the other variables do not seem to have a significant loading on either PC.*

PCA with raw data: PC1 explains almost all of the variance. GNI has a heavy negative loading on PC1, while the other variables do not seem to have a significant loading on either PC.

Next we standardize the variables and repeat the above analysis.

#standardize the variables
human_std <- scale(human)

#perform principal component analysis on the standardized variables (with the SVD method)
pca_human_std <- prcomp(human_std)

#create and print out a summary of pca_human_std
s2 <- summary(pca_human_std)

#rounded percentanges of variance captured by each PC
pca_pr2 <- round(100*s2$importance[2, ], digits = 2)

#print out the percentages of variance
pca_pr2
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 53.61 16.24  9.57  7.58  5.48  3.60  2.63  1.30
# create object pc_lab to be used as axis labels
pc_lab2 <- paste0(names(pca_pr2), " (", pca_pr2, "%)")

#draw a biplot of the principal component representation and the standardized variables
biplot(pca_human_std, choices = 1:2, cex = c(0.8, 1), col = c("navyblue", "firebrick1"), xlab = pc_lab2[1], ylab = pc_lab2[2])
*PCA with standardized data: PC1 explains 53.61% of the variance, while PC2 explains 16.24%. Mat.Mor and Ado.Birth have both positive loading on PC1, while Edu2.FM, GNI, Edu.Exp and Life.Exp have negative loading on PC1. Parli.F and Labo.FM have both a positive loading on PC2. No variables seem to have a significant negative loading on PC2.*

PCA with standardized data: PC1 explains 53.61% of the variance, while PC2 explains 16.24%. Mat.Mor and Ado.Birth have both positive loading on PC1, while Edu2.FM, GNI, Edu.Exp and Life.Exp have negative loading on PC1. Parli.F and Labo.FM have both a positive loading on PC2. No variables seem to have a significant negative loading on PC2.

As we can see, the results for the two PCAs are very different. In the first PCA using the non-standardized data, we see that the first principal component (PC1) explains almost all of the variance (99,99%) and the second principal component (PC2) explains only 00,01%. After standardization, PC1 explains 53,61% and PC2 explains 16,24% of the variance. I believe this difference is due to the different scaling of the raw variables. Looking at the previous section, we can see that GNIhas a much larger scale compared to the other variables. This is evident in the first biplot, where the only arrow that is visible is GNI. This shows that the first PCA has loaded heavily on the GNI variable. Once the variables are standardized the scaling is very similar between variables and the PCA is no longer influenced by possible different variable scales.

My own interpretation of the first two principal component dimensions is that the PC1 relates to how developed the country is. The reasoning behind this comes from the variables related to PC1: Mat.Mor, Ado.Birth, Edu2.FM, GNI, Edu.Exp, Life.Exp. All of the variables are key factors when assessing the development stage of a country. This is further supported when looking at biplot as richer and more developed countries (e.g. many European nations) are on the left side while many poorer countries (e.g. many African nations) are on the right side. The PC2 seems to be related to female participation in society as the variables related to PC2 were Parli.F and Labo.FM. We can also notice that the countries that are furthest away from these variables are fairly male-centric (e.g. arabic nations).

Multiple Correspondence Analysis

Next we will look at tea data. The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions). We will look at the dimensions and structure of the data and finally visualize it.

#read tea data set into memory and convert character variables to factors
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

#look at the dimension and structure of the data
dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
#browse the contents of the data frame
View(tea)

#access dplyr library
library(dplyr)

#selecting columns to keep in the data set for visualization and MCA
keep_columns <- c("Tea", "How", "sugar", "how", "sex", "age_Q", "frequency")

#select the 'keep_columns' to create a new data set
tea_time <- dplyr::select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(keep_columns)
## 
##   # Now:
##   data %>% select(all_of(keep_columns))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#access the ggplot and tidyr library
library(ggplot2)
library(tidyr)

#visualize the data set
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) + facet_wrap("name", scales = "free")

Looking at the structure of the tea data frame, we see that there are 36 different variables and 300 observations. Almost all of the variables are factors except age which is an integer. For the visualization and upcoming analysis I chose seven variables that seemed interesting to me. From the visualization we can see that Earl Grey is the most popular tea variety. Most people drink their tea alone (without any additions). Most people drink tea at least once per day. Most people use tea bags. Use of sugar is fairly split between. There were slightly more females than males in the questionnaire. The most populous age quartile from the questionnaire was ages 15 to 24.

Next we do the multiple correspondence analysis (MCA) for the selected variables.

#multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)

#summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.246   0.216   0.186   0.179   0.168   0.159   0.149
## % of var.             10.752   9.435   8.137   7.841   7.347   6.947   6.534
## Cumulative % of var.  10.752  20.187  28.324  36.165  43.512  50.459  56.993
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.145   0.130   0.123   0.122   0.113   0.101   0.095
## % of var.              6.358   5.701   5.396   5.358   4.933   4.419   4.171
## Cumulative % of var.  63.351  69.052  74.448  79.807  84.740  89.158  93.329
##                       Dim.15  Dim.16
## Variance               0.078   0.074
## % of var.              3.430   3.241
## Cumulative % of var.  96.759 100.000
## 
## Individuals (the 10 first)
##              Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2
## 1         |  0.015  0.000  0.000 |  0.558  0.481  0.140 |  0.396  0.282  0.071
## 2         |  0.611  0.507  0.171 |  0.049  0.004  0.001 |  0.020  0.001  0.000
## 3         |  0.366  0.181  0.107 | -0.559  0.483  0.250 | -0.331  0.197  0.088
## 4         | -0.752  0.768  0.450 |  0.019  0.001  0.000 |  0.147  0.039  0.017
## 5         |  0.241  0.079  0.043 | -0.109  0.018  0.009 | -0.326  0.190  0.078
## 6         | -0.376  0.192  0.114 | -0.123  0.023  0.012 | -0.073  0.009  0.004
## 7         |  0.096  0.012  0.003 |  0.144  0.032  0.008 |  0.082  0.012  0.003
## 8         |  0.449  0.273  0.065 |  0.155  0.037  0.008 |  0.664  0.791  0.143
## 9         |  0.257  0.089  0.028 |  0.014  0.000  0.000 |  0.050  0.004  0.001
## 10        |  0.675  0.619  0.199 |  0.096  0.014  0.004 |  0.001  0.000  0.000
##            
## 1         |
## 2         |
## 3         |
## 4         |
## 5         |
## 6         |
## 7         |
## 8         |
## 9         |
## 10        |
## 
## Categories (the 10 first)
##               Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2  v.test  
## black     |   1.056  15.999   0.365  10.452 |   0.459   3.448   0.069   4.545 |
## Earl Grey |  -0.457   7.812   0.377 -10.614 |  -0.293   3.658   0.155  -6.804 |
## green     |   0.304   0.592   0.011   1.850 |   0.684   3.405   0.058   4.155 |
## alone     |  -0.035   0.047   0.002  -0.831 |  -0.180   1.389   0.060  -4.232 |
## lemon     |  -0.330   0.698   0.013  -2.009 |   0.487   1.725   0.029   2.958 |
## milk      |   0.025   0.008   0.000   0.227 |   0.306   1.306   0.025   2.731 |
## other     |   1.798   5.635   0.100   5.466 |  -0.037   0.003   0.000  -0.114 |
## No.sugar  |   0.631  11.946   0.425  11.275 |  -0.223   1.704   0.053  -3.990 |
## sugar     |  -0.674  12.770   0.425 -11.275 |   0.239   1.822   0.053   3.990 |
## tea bag   |  -0.192   1.218   0.048  -3.802 |  -0.036   0.048   0.002  -0.708 |
##             Dim.3     ctr    cos2  v.test  
## black       0.537   5.456   0.094   5.310 |
## Earl Grey  -0.010   0.005   0.000  -0.225 |
## green      -1.147  11.110   0.163  -6.971 |
## alone      -0.318   5.037   0.187  -7.485 |
## lemon       0.412   1.434   0.021   2.505 |
## milk        0.375   2.267   0.037   3.342 |
## other       2.747  17.383   0.233   8.352 |
## No.sugar   -0.320   4.076   0.110  -5.730 |
## sugar       0.343   4.357   0.110   5.730 |
## tea bag     0.334   4.859   0.146   6.607 |
## 
## Categorical variables (eta2)
##             Dim.1 Dim.2 Dim.3  
## Tea       | 0.420 0.159 0.216 |
## How       | 0.110 0.067 0.340 |
## sugar     | 0.425 0.053 0.110 |
## how       | 0.084 0.171 0.150 |
## sex       | 0.045 0.517 0.000 |
## age_Q     | 0.478 0.403 0.421 |
## frequency | 0.159 0.140 0.065 |
#visualize MCA with the variable biplot
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")

Looking at the variable biplot, we can see that the first two dimensions of the MCA capture 10.75% and 9.44% of the total variance. The horizontal dimension opposes “other” tea drinkers with the other options. The vertical dimension opposes male and female as well as unpackaged tea to the other options. It also seems to oppose Earl Grey form the other tea varieties. The age quartiles are linked to both dimensions. Looking at how the variables are spread, we can make the assumption that older men are more likely to use unpackaged tea of green and black variety than young females.

(more chapters to be added similarly as we proceed with the course!)